Prior 96w evaluation of TurboCapture-Seq v2 showed low UMIs recovered and high seq saturation. I did low throughput troubleshooting and didn’t see any issues. Process this experiment myself taking care to remove residual liquids from wash steps.
Process Hui Shi of Naik lab Bcor + Flt3 timecourse. Includes a few of my samples.
Samples
SCE object in generate in 1A_generateSCE_reads notebook. The samples were then clustered in the 2B Clustering Wt notebook
Compare different subsets sorted from dendritic cells at 5 and 7 days. The easiest contrasts to interpret are the difference from the common dendritic cell precursor (CDP).
Workflow is from:
https://bioconductor.org/packages/devel/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#differential-expression-analysis
This was generated in notebook 2B.
sce <- readRDS(here::here(
"data/TIRE_dendritic_mouse/SCEs", "DCs_cluster.sce.rds"))
dge <- scran::convertTo(sce, type="edgeR")
dge_orig <- dgeHave a look at the important metadata.
There are not enough replicates for the preDC contrast to be
meaningful.
tb <- dge$samples[,c("Cell_type", "Cell_number", "Ligand", "Timepoint_Day")]
tb %>%
dplyr::count(Cell_type, Ligand, Timepoint_Day)The cell type is the major difference between samples with the
timepoint being much less so.
Write the day as text.
pca_tb <- as_tibble(reducedDim(sce))
pca_tb$Cell_Type <- sce$Cell_type
pca_tb$Timepoint <- sce$Timepoint_Day
pca_tb$Timepoint <- as.factor(pca_tb$Timepoint)
plt2 <- ggplot(pca_tb, aes(x=PC1, y=PC2, colour=Cell_Type, label=Timepoint)) +
geom_text(size=5) +
xlab("PC1 (34%)") + ylab("PC2 (28%)") +
scale_colour_brewer(palette = "Dark2") +
theme_Publication()
plt2Interested in only the total DC cultures and cDC2s. Remove the other cell types.
Doing this reduces the multiple testing burden and fits variation better. After half the genes are removed leaving 10,000.
## [1] 21484 15
keep.exprs <- filterByExpr(dge, group=dge$samples$Cell_type, min.count=1)
dge <- dge[keep.exprs,, keep.lib.sizes=FALSE]
dim(dge)## [1] 9619 15
Correct for composition biases by computing normalization factors with the trimmed mean of M-values method.
Looking at the PCA and deciding on a subset to focus on cDC2 seems like a good choice:
Day 5:
Expansion of DC progenitors and early differentiation into immature DCs.
Presence of a heterogeneous mix of progenitor cells and early DC
subsets. Limited functional capacity.
Day 7:
Peak numbers of mature DCs with distinct cDC1, cDC2, and pDC
populations. Enhanced antigen presentation and cytokine production
abilities. Optimal time point for harvesting DCs for functional
assays.
Set up a spline based analysis
Use a cubic regression spline curve with 3 degrees of freedom.
dge$samples$Timepoint_Day <- as.numeric(as.character(dge$samples$Timepoint_Day))
splines <- ns(dge$samples$Timepoint_Day, df = 3)
design <- model.matrix(~splines, dge$samples)
head(design)## (Intercept) splines1 splines2 splines3
## ACAGTAGCTC 1 0.00000000 0.0000000 0.0000000
## AGACGCATCG 1 -0.04015528 0.3459654 -0.2631210
## ATCAGCGCGA 1 -0.11108086 0.4651511 0.6459298
## CCGTATGTAG 1 0.00000000 0.0000000 0.0000000
## CGTCAATAGT 1 0.00000000 0.0000000 0.0000000
## CGTCACTCTA 1 -0.11108086 0.4651511 0.6459298
The three coefficients do not have any particular meaning. Hypothesis testing would only make sense if the three coefficients are assessed together. The advantage of using a cubic spline curve is that it provides more stable fit at the end points compared to a polynomial.
The spline curve with 3 degrees of freedonm has 2 knots where cubic polynomials are splined together. In general, choosing a number of degrees of freedom to be in range of 3-5 is reasonable.
Circle back to see why BCV has no dots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001107 0.013019 0.014557 0.013520 0.015816 0.021353
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9699 0.9916 1.0414 1.0314 1.0581 1.7072
In a time course experiment, we are looking for genes that change
expression level over time. Here, the design matrix uses 3 natural
spline basis vectors to model smooth changes over time, without assuming
any particular pattern to the trend.
We test for a trend by conducting F-tests on 3 df for each gene:
The topTags function lists the top set of genes with most significant time effects.
The total number of genes with significant (5% FDR) changes at different doses can be examined with decideTests.
There are 3200 differentially expressed genes that change over the 7 days.
## splines3-splines2-splines1
## NotSig 6426
## Sig 3193
Note that all three spline coefficients should be tested together in
this way. It is not meaningful to replace the F-tests with t-tests for
the individual coefficients, and similarly the logFC columns of the top
table do not have any interpretable meaning.
The trends should instead be interpreted by way of trend plots, as we
show now.
Finally, we visualize the fitted spline curves for the top four genes.
We start by computing the observed and fitted log-CPM values for each
gene:
top_genes <- head(results$Symbol,4)
logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)
# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>%
rename(fit_cpm = count)
lcm.obs$fit_cpm <- lcm.fit$fit_cpm
# Add back the sample metadata
lcm <- left_join(lcm.obs, results[,c("ID", "Symbol")])
lcm <- left_join(lcm, dge$samples[,c("Well_BC", "Timepoint_Day")], by=c("sample" = "Well_BC"))The hits are all MHCII genes unsurprisingly.
The CD74 gene encodes a protein known as the HLA class II histocompatibility antigen gamma chain, also referred to as the invariant chain (Ii) or CD74. This protein is crucial for the proper functioning of the immune system, particularly in antigen presentation by Major Histocompatibility Complex (MHC) class II molecules.
Any gene that begins with H2- in mice is part of the H-2 complex, the murine equivalent of the human MHC (called HLA). These genes encode proteins involved in the immune system’s ability to recognize self from non-self.
Roles of H2- Genes:
ggplot(lcm) +
geom_point(aes(y = obs_cpm, x=Timepoint_Day), color = "black") +
geom_smooth(aes(y = fit_cpm, x=Timepoint_Day), method = "loess", se = FALSE, color = "red") +
xlab("Time (days)") + ylab("logCPM") +
theme_Publication(base_size = 22) + theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1)) +
facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
theme(strip.text = element_text(face = "bold.italic"))The results make sense to me. Will confirm with the domain experts.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
##
## Matrix products: default
## BLAS: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRblas.so
## LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid splines stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ggthemes_5.1.0 here_1.0.1
## [3] org.Mm.eg.db_3.19.1 AnnotationDbi_1.66.0
## [5] ggvenn_0.1.10 knitr_1.48
## [7] patchwork_1.3.0 edgeR_4.2.2
## [9] limma_3.60.6 scater_1.32.1
## [11] scran_1.32.0 scuttle_1.14.0
## [13] lubridate_1.9.3 forcats_1.0.0
## [15] stringr_1.5.1 dplyr_1.1.4
## [17] purrr_1.0.2 readr_2.1.5
## [19] tidyr_1.3.1 tibble_3.2.1
## [21] ggplot2_3.5.1 tidyverse_2.0.0
## [23] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [25] Biobase_2.64.0 GenomicRanges_1.56.2
## [27] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [29] S4Vectors_0.42.1 BiocGenerics_0.50.0
## [31] MatrixGenerics_1.16.0 matrixStats_1.4.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gridExtra_2.3
## [3] rlang_1.1.4 magrittr_2.0.3
## [5] RSQLite_2.3.7 compiler_4.4.1
## [7] mgcv_1.9-1 DelayedMatrixStats_1.26.0
## [9] png_0.1-8 vctrs_0.6.5
## [11] pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 XVector_0.44.0
## [15] labeling_0.4.3 utf8_1.2.4
## [17] rmarkdown_2.28 tzdb_0.4.0
## [19] UCSC.utils_1.0.0 ggbeeswarm_0.7.2
## [21] bit_4.5.0 xfun_0.48
## [23] bluster_1.14.0 zlibbioc_1.50.0
## [25] cachem_1.1.0 beachmat_2.20.0
## [27] jsonlite_1.8.9 blob_1.2.4
## [29] highr_0.11 DelayedArray_0.30.1
## [31] BiocParallel_1.38.0 irlba_2.3.5.1
## [33] parallel_4.4.1 cluster_2.1.6
## [35] R6_2.5.1 RColorBrewer_1.1-3
## [37] bslib_0.8.0 stringi_1.8.4
## [39] jquerylib_0.1.4 Rcpp_1.0.13
## [41] Matrix_1.7-0 igraph_2.0.3
## [43] timechange_0.3.0 tidyselect_1.2.1
## [45] rstudioapi_0.17.0 abind_1.4-8
## [47] yaml_2.3.10 viridis_0.6.5
## [49] codetools_0.2-20 lattice_0.22-6
## [51] KEGGREST_1.44.1 withr_3.0.1
## [53] evaluate_1.0.1 Biostrings_2.72.1
## [55] pillar_1.9.0 generics_0.1.3
## [57] rprojroot_2.0.4 hms_1.1.3
## [59] sparseMatrixStats_1.16.0 munsell_0.5.1
## [61] scales_1.3.0 glue_1.8.0
## [63] metapod_1.12.0 tools_4.4.1
## [65] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
## [67] locfit_1.5-9.10 colorspace_2.1-1
## [69] nlme_3.1-164 GenomeInfoDbData_1.2.12
## [71] beeswarm_0.4.0 BiocSingular_1.20.0
## [73] vipor_0.4.7 cli_3.6.3
## [75] rsvd_1.0.5 fansi_1.0.6
## [77] S4Arrays_1.4.1 viridisLite_0.4.2
## [79] gtable_0.3.5 sass_0.4.9
## [81] digest_0.6.37 SparseArray_1.4.8
## [83] ggrepel_0.9.6 dqrng_0.4.1
## [85] farver_2.1.2 memoise_2.0.1
## [87] htmltools_0.5.8.1 lifecycle_1.0.4
## [89] httr_1.4.7 statmod_1.5.0
## [91] bit64_4.5.2